Data Source Summary

This document provides a summary of various data sources used in our analysis, along with references to the R code that handles data reading and cleaning.


1. Temperature Station

1.1 Description:

We are examining daily climate data from temperature stations across Canada, focusing on five key stations that are most representative of the entire province of British Columbia. The dataset includes the following columns:

x: Longitude

y: Latitude

LOCAL_DATE: Date in the format year-month-day TOTAL_PRECIPITATION: Total precipitation in mm

STATION_NAME: Name of the station

MEAN_TEMPERATURE: Mean of daily maximum and minimum temperatures

MAX_TEMPERATURE: Daily maximum temperature

MIN_TEMPERATURE: Daily minimum temperature

TOTAL_RAIN: Total rainfall

MIN_REL_HUMIDITY: Minimum relative humidity

LOCAL_YEAR: Year

LOCAL_MONTH: Month

Our goal is to identify stations with data extending back to 1941 in order to compare the heatwave of 1941 with that of 2021.



1.2 Data Access



1.3 Weather Station Location & Year Range

YVR

year range: 1937-2024

Station Name Start Year End Year Station ID Start Date End Date
VANCOUVER INTL A 1937 2024 1108447 1937-01-01 2024-08-01
VANCOUVER INTL A 1937 2024 1108395 2013-06-13 2024-08-01
location information:



Abbotsford

year range: 1935-2024

table:

Station Name Start Year End Year Station ID Start Date End Date
ABBOTSFORD A 2012 2024 1100032 2012-06-21 2024-07-30
ABBOTSFORD A 1944 2024 1100030 1944-10-01 2012-06-20
ABBOTSFORD UPPER SUMAS 1935 1946 1100040 1935-11-01 1946-03-31
location information:



Prince_George

year range: 1940-2024

table:

Station Name Start Year End Year Station ID Start Date End Date
PRINCE GEORGE 1942 2009 1096450 1942-07-01 2009-10-21
PRINCE GEORGE 2009 2024 1096439 2009-10-22 2024-07-30
PRINCE GEORGE 1912 1945 1096436 1912-08-01 1945-06-30
location information :



#### FortNelson

year range: 1937-2024

table:

Station Name Start Year End Year Station ID Start Date End Date
FORT NELSON A 1937 2012 1192940 1937-09-01 2012-11-14
FORT NELSON A 2012 2024 1192946 2012-11-08 2024-07-30
location information :



Kelowna

year range: 1899-2024

table:

Station Name Start Year End Year Station ID Start Date End Date
KELOWNA 1899 1962 1123930 1899-03-01 1962-09-30
KELOWNA 1961 1969 1123975 1961-08-01 1969-11-30
KELOWNA A 1968 2005 1123970 1968-10-01 2005-09-30
KELOWNA 2005 2024 1123939 2005-09-03 2024-07-30
KELOWNA UBCO 2013 2024 1123996 2013-12-16 2024-07-30
location information :

  • It is important to note that total precipitation data from 1123939 station (years 2009 to 2024) may be incomplete.

  • To ensure data integrity, particularly for calculating the maximum consecutive dry day, missing precipitation values have been supplemented with data from a nearby secondary UBC-O station.

  • Since treating missing data as zero would significantly distort the calculation of consecutive dry days.

  • The original Kelowna dataset contained 191 missing values, while the UBC-O dataset provided 343 complete data points. After supplementation, the Kelowna dataset now contains 132 missing values, indicating that 59 missing values were filled. within 2013-2024 period.

1.4 Achieved

n the early stages of our analysis, we explored data from the following stations:

  • Kamloops: 1939-2024

  • Penticton: 1941-2024

For analyzing the relationship between temperature and field crop yield data, we picked the station from the Peach River region:

  • FortStJohn: 1910-2024



1.5 R code for data reading & cleaning

  • The data is read using the process_and_save_data() function, located in ../climate_extreme_RA/R/read_data_(#accrodingly station name).R

    • Within the process_and_save_data() function, deal_with_non_exist_date() detects missing dates (i.e., dates that should be present based on the expected continuous time series but are absent). The function addresses these missing rows by adding rows for these dates and filling in NA for the other value columns.


    • Note: When there is an overlap in the date ranges of different station data, I manually(before running process_and_save_data() to read the data ) remove the rows with overlapping dates from the older station data and retain the corresponding rows from the newer station data.

      For example, given two date ranges:

      • Station A: October 1, 1944, to June 20, 2012

      • Station B: November 1, 1935, to March 31, 1946

      I will preserve the data from Station B for the period from November 1, 1935, to September 30, 1944, and the data from Station A for the period from October 1, 1944, to June 20, 2012.This adjusted data then becomes the original raw data.

  • The raw data path is specified at the beginning of each station’s analysis R Markdown file, located in ../climate_extreme_RA/reports_station/(# According Station Name)_heatwave_analysis.Rmd


2 ERA5

2.1 Description:

  • For monthly data
    • Dataset Name: “ERA5-Land monthly averaged data from 1950 to present
    • Year Range: January 1950-2024
    • Update frequency: Monthly with a delay of 2-3 months relatively to the actual date.
    • Data Frequency: Monthly
    • Variables: a large number of atmospheric, ocean-wave and land-surface quantities.
    • Geography: globally, 0.1°×0.1° (9km×9km) gridded
  • For hourly data
    • Dataset Name: “ERA5-Land hourly data from 1950 to present
    • Year Range: January 1950-2024
    • Update frequency: Monthly with a delay of 2-3 months relatively to the actual date.
    • Data Frequency: Hourly
    • Variables: a large number of atmospheric, ocean-wave and land-surface quantities.
    • Geography: globally, 0.1°×0.1° (9km×9km) gridded
  • We utilize the ERA5-Land dataset rather than ERA because the script package (which will be shown in section 2.3) is specifically compatible with ERA5-Land.
  • Since pre-aggregated daily data does not exist, we must download the hourly data and manually aggregate into daily data.
  • The ERA5 dataset has been widely utilized in numerous studies and publications, including notable examples such as Anderson and Radić (2023) and Malinina and Gillett (2024).

2.2 Data Access:

  • There are two ways to obtain the dataset:
    • Through the website directly.
    • By using a python library’s script, which automatically makes requests to the website.
2.2.1 Get Monthly Data from Website Form:
  • Step1: find the dataset
    • Link to data source: ERA5 Data

    • If the link does not work, the data can be accessed through the official website of the Copernicus Climate Change Service by following the navigation path outlined below:

    • Home > Data Set(in the search bar, search for “ERA5-Land monthly averaged data from 1950 to present”) > ERA5 Data

  • Step2:

    • The “Overview” option provides a detailed explanation of each variable.

    • To download data set, click on the “Download data” option

  • Step3: select options and download data

    • Then select product type and variable as below:

    • Next, select the desired year and month, and ensure that the “time” is set to 00:00.

    • For the area, YVR approximately spans from North 49.5 to South 48.4, and from West -124.0 to East -122.0. (We’ll cover how to accurately define the region using a visualization code package in section 2.2.4.) The image below offers a helpful reference of latitude and longitude range.

    • Then select the NetCDF format, then click the “Submit Form”

  • Step 4: check the download process

    • click “Your requests” tab at the very top.

      • Please note that the “My Request” section will not appear between the “Applications” and “Toolbox” sections, as you are not registered or logged in to the website yet

      • Once you create an account, you will be able to view your download requests on the website. Additionally, any API calls you execute through scripts will be reflected in this section.

    • If you click on the triangle next to the product name, you can view more details about this specific download request.

    • Once the request is completed, the data will be available for download via the green “Download” button.

    • The file will be in .nc (NetCDF) format.

2.2.2 Get Hourly Data from Website Form:
  • Step1: find the dataset
    • Link to data source: ERA5 Data Hourly

    • If the link does not work, the data can be accessed through the official website of the Copernicus Climate Change Service by following the navigation path outlined below:

    • Home > Data Set(in the search bar, search for “ERA5-Land hourly data from 1950 to present”) > ERA5Data

  • Step 2:
    • The “Overview” option provides a detailed explanation of each variable.

    • To download data set, click on the “Download data” option

  • Step3: select options and download data
    • then select variable as below:

    • Next, select the desired year, month, day, and hour.

      • Keep in mind that ERA5 uses the UTC time zone for hours.

      • For more detailed information, refer to the “Quality assessment” section besides the “Download data”, and click on the user guidance: link ERA5-Land hourly data from 1950 to present  A new CDS soon to be launched - expect some disruptions and watch this page for latestt?. Thank you.  Overview  Download data  Quality assessment  Documentation  This is a new feature, work in progress. Should any inconsistency be found, please report to https://support.ecmwf.int?  The CDS datasets are assessed by the Evaluation and Quality Control (EQC) function of C3S independently of the data supplier. EQC  encompasses a framework of processes aimed to assure technical and scientific quality harmonized across all dataset types available through  the CDS. During the EQC process, the documentation provided with the dataset is scrutinized and data are checked for usability and reliability.  Variable:  2m temperature  • Variable: 2m temperature  INTRODUCTION  Dataset overview  Temporal and spatial coverage and resolution  Providers  Dataset version  Data update  User guide  USER DOCUMENTATION  User guide  Scientific methodology  Uncertainty quantification  Validation  Inter-comparison  ACCESS  Toolbox compatibility  Archive  Overview of input data and methods, general guidelines for the data usage, etc  Is there a user guide?  Link to user guide  Is there a user forum provided for the  dataset?  Yes  Yes  r_? Last updated on 01/07/2021  INDEPENDENT ASSESSMENT  Data check  Expert evaluation  Dataset maturity  Key strengths and limitations  Last update on 01/07/2021  entation

    • Please note that you can only select one year at a time, which is why in section 2.3, we use a script for batch processing to retrieve data from 1950 to the present. Otherwise, you would need to manually click through many selections.

    • For the area, YVR approximately spans from North 49.5 to South 48.4, and from West -124.0 to East -122.0. (We’ll cover how to accurately define the region using a visualization code package in section 2.2.4.) The image below offers a helpful reference of latitude and longitude range.

    • Then select the NetCDF format, then click the Submit Form

  • Step 4: check the download process
    • click “Your requests” tab at the very top.

      • Please note that the “My Request” section will not appear between the “Applications” and “Toolbox” sections, as you are not registered or logged in to the website yet

      • Once you create an account, you will be able to view your download requests on the website. Additionally, any API calls you execute through scripts will be reflected in this section.

    • If you click on the triangle next to the product name, you can view more details about this specific download request.

    • Once the request is completed, the data will be available for download via the green “Download” button.

    • The file will be in .nc (NetCDF) format.

2.3 Script for Hourly Data

To obtain daily data, we must download the hourly data and manually aggregate into daily data.

To do this efficiently across multiple years, we use the era5cli, a command line interface to download ERA5 data from the website (in Section 2.2.1 or 2.2.2), which allows us to download batches of hourly data.

The era5cli library is recommended in the Malinina and Gillett (2024), specifically in the Data Availability section, where it states:

“ERA5 data was retrieved from the Copernicus Climate Change Service (C3S) Climate Data Store (CDS) (Hersbach et al., 2020a) using the era5cli software.”

The steps for using this script are outlined below:

Step 1: Introduction

The era5cli library is introduced in the following CLI usage.

This library enables users to send download requests to the two websites (monthly and hourly data) via code. The generated API call simulates a request made directly through the website. For more details, please check Dataset overview.

Therefore, once logged in to the website (in Section 2.2.1 or 2.2.2), you can view and manage these download requests in the “Your Request” section.

When requesting hourly data for multiple years, the script automatically generates multiple requests simultaneously in the website.

Step 2: Install the Script

Ensure that the script and all necessary dependencies are installed.

https://era5cli.readthedocs.io/en/stable/getting_started/

This document has a clear step by step guidance of how to install the related package.

Step 3: Understand the Script

Review the script to understand its functionality and how it interacts with the era5cli library.

You could see the detailed explanation from how to Formulating requests.

For each of the request, we can choose different type of variables / parameters, the overall explanation is under Argument overview (argument to pick area, time etc.) and Variable overview (to select type of climate variables).

And in our cases, we will choose variables from the ERA5-Land, which includes 2m_temperature,total_precipitation and 2m_dewpoint_temperature (dewpoint related to humidity)

Step 4: Use the Script

Execute the script to begin downloading the hourly data.

In our case, the script is

era5cli hourly 
--variables 2m_temperature 
--land
--startyear 1940 
--endyear 2024 
--months 4 5 6 7 8 9 10 
--hours 0 1 18 19 20 21 22 23 
--area 49.5 -124.0 48.4 -122.0 
--merge

when we put the script in any terminal, it could be like this:

era5cli hourly --variables 2m_temperature --land --startyear 1940 --endyear 2024 --months 4 5 6 7 8 9 10 --hours 0 1 18 19 20 21 22 23 --area 49.5 -124.0 48.4 -122.0 --merge

We want to study the summer temperature, the hours is from 11am to 6pm in BC, Canada, thus should be 18pm - 1am in UTC time zone.

--merge means we merge all year in one file (Merge yearly output files. Default is split output files into separate files for every year)

--land means we use the ERA5-land data.

  • recommandation: sepeare the year to be for instance one or two decade window to aviod massive download time and any sudden interrupt.

Step 5: Check the Download

After the script has run, check the “Your Request” section on the website to verify that the download requests have been successfully generated and are in progress.


2.4 Python code for read NetCDF and visualization of Location Selection

  • After downloading hourly climate data for each year using script in Section 2.3,I gathered all NetCDF data into a single ZIP file located at ../climate_extreme_RA/R/era5/hourly_yvr.zip. This ZIP file contains the hourly data for all years.

  • Next, the NetCDF files are read using a Python script found in the jupyter notebook ../climate_extreme_RA/R/era5/ERA5 hourly data.ipynb. The notebook is divided into three sections: data download, exploratory data analysis (EDA), and a sample check for location data.

  • In the Download Section

    • the clean_and_wrangle_df() function is utilized to clean the data and adjust the time zone from UTC to Vancouver time. This is done using the following command: df['time'] = df['time'].dt.tz_localize('UTC').dt.tz_convert('America/Vancouver')

    • Finally, the cleaned data is then saved as a CSV file in the directory ../climate_extreme_RA/data/era5_YVR.csv

  • In the Sample Check for Location Section, Plotly Express is used to visualize the temperature data for geographical locations within Vancouver. The following code snippet generates an interactive map:

import plotly.express as px

# Specify a basic Mapbox style
mapbox_style = "carto-positron"

# Create a scatter plot on a map with colors based on temperature values
fig = px.scatter_mapbox(df, lat='latitude', lon='longitude', hover_name='t2m', zoom=10,color='t2m',  # Use temperature values to determine colors
mapbox_style=mapbox_style)

# Add a title to the plot
fig.update_layout(title='Van Temperature Distribution by Location')
fig.show()

This plot allows users to zoom in and out, enabling a closer examination of temperature distributions across different grid points, corresponding to specific latitude and longitude ranges.

And the data can be filtered to isolate the exact grid points of interest

2.5 R code for data reading & wrangling

After converting the NetCDF data to CSV format using the Python notebook, the data is then read into R for further analysis.

  • The R script for this process is located at ../climate_extreme_RA/R/era5/YVR_read_wrangling_data.R.

  • The data focus on a specific grid point that closely matches the temperature station at YVR in Vancouver, BC. The following R code illustrates this process:

#Define file paths
file_paths <- c("../data/era5_YVR.csv"
                )
# Function to read and select necessary columns
read_and_select <- function(file_path) {
  read.csv(file_path) 
  #%>%
   # select(all_of(needed_columns))
}

# Read and combine all datasets
era5_yvr <- map_dfr(file_paths, read_and_select)

# Display the first few rows of the combined data
head(era5_yvr)

ordered_era5_yvr <- era5_yvr %>%
  arrange(date)
  • The final dataset, named ordered_era5_yvr, is structured by selecting the grid point that most accurately represents the temperature station YVR, ensuring the data is well-organized for subsequent analysis.

3 STATCanada Potato data


3.1 Description:

  • Dataset ID: “Table: 32-10-0358-01 (formerly CANSIM 001-0014)”
  • Year Range: 1910-2024
  • Frequency: Annual
  • Variables: Avg Yield, unit is Hundredweight per harvested acre
  • Geography: Canada, Province-wide: British Columbia (BC); Other provinces
  • Rationale for choosing potatoes: According to (Sonnewald et al. 2015), potatoes are highly vulnerable to high temperatures, which negatively impact tuber development, storage, and seed potato fitness. To investigate the potential correlation between temperature extremes, including heat waves, and long-term yield patterns in British Columbia, a comprehensive yield dataset is required.


3.2 Data Access

Link to data source: Statistics Canada

If the link does not work, the data can be accessed through the official website of the Statistics Canada by following the navigation path outlined below:

Home > Data (in the search bar, search for “Area, production and farm value of potatoes” or Dataset ID: 32-10-0358-01) > Area, production and farm value of potatoes


3.3 Select and download data in the web

  • Step1: click the Add/Remove button


  • Step2: the pic below shows the layout of column filter option

    • Step2.1 Select the Geography:

      • Click on the Geography tab.

      • Don’t Check the box next to “Canada” as it will include nationwide data for the entire country.

      • Expand the list by clicking the “+” symbol next to “Canada.” and select the desired provinces: British Columbia (BC).


    • Step2.2 Choose the Variables:

      • Navigate to the Area, production and farm value of potatoes tab.

      • From the list of available variables, select only Average yield, potatoes


    • Step2.3 Click on the Reference period tab:

      • Select the years, ranging from 1910 to 2024.
    • Step2.4 Customize the Layout:

      • Adjust the layout settings as preferred.

      • Note that these settings will only affect the display of the dataset on the website and will not influence the structure of the data when it is downloaded.


  • Step 3: Download the dataset

    • First Click on the “Download options” Button, then from the list of available download formats, choose the “Download selected data (for database loading).”

3.4 R code for data reading & wrangling

  • The data is read within ../climate_extreme_RA/R_agricultural/read_data.R
  • The final data is called data_pot
# File paths for Potato data 
file_paths <- c("../data/agri/Potato_Data.csv") 
# Define the columns needed,UOM is unit, VALUE is yield 
needed_columns <- c("REF_DATE", "VALUE","UOM") 
# Function to read and select necessary columns, and add crop type 
read_and_select_pot <- function(file_path) { 
  read_csv(file_path) %>% select(all_of(needed_columns)) } 
# Read and combine all datasets 
data_pot <- map_dfr(file_paths, read_and_select_pot) 
#create new column called crop type 
data_pot$Crop_Type <- "Potato"
  • ../climate_extreme_RA/R_agricultural/model_potato.R uses a Moving Average to Detrend Time Series yield Data(transfer from Blue line to red line)

  • Then, the functions within the file are employed as follows: lm_monthly_potato and lm_season_potatoare used to fit potato yield which is annually yield across British Columbia, VS EHF (utilizing daily maximum EHF from Kelowna as independent variables, aggregated into monthly or seasonal maximums for each year) through linear regression.

  • Additionally,lm_onemonth_potatois used to fit 12 single-month EHF versus yield linear regressions.

3.5 Conclusion

  • The adjusted R-squared values are low for all linear models, and the correlation, as observed through scatter plots,varies and is difficult to discern.

  • Upon examining the daily EHF curves from the 40 highest and 40 lowest yield years, no clear pattern distinguishes the curves between high and low yield years.

  • This detailed analysis can be found in the report located at ../climate_extreme_RA/reports_station/Kelowna_heatwave_analysis.Rmd and ../climate_extreme_RA/report_agricultural/agri_lm_model.Rmd in Part 2: patato data

4 STATCanada Fruits data

4.1 Description:

  • Dataset ID: “Table: 32-10-0364-01 (formerly CANSIM 001-0009)”

  • Year Range: 1926-2024

  • Frequency: Annual

  • Variables:

    • Marketed production from 1926-2024, unit in ton

    • Total Cultivated area from 2002- 2024, unit hectares

  • Geography: Canada, Province-wide: British Columbia (BC); Other provinces

To investigate the potential correlation between temperature extremes, including heat waves, and long-term yield patterns in British Columbia, a comprehensive yield dataset is required.

Therefore, it is necessary to estimate the total cultivated area for years prior to 2002 to cover a longer period, as marketed production data is available from 1926 to 2024.


4.2 Data Access

Link to data source: Statistics Canada

If the link does not work, the data can be accessed through the official website of the Statistics Canada by following the navigation path outlined below:

Home > Data (in the search bar, search for “Area, production and farm gate value of marketed fruits” or Dataset ID: 32-10-0364-01) > Area, production and farm value of marketed fruits


4.3 Select and download data in the web

  • Step1: click the Add/Remove button


  • Step2: the pic below shows the layout of column filter option

    • Step2.1 Select the Geography:

      • Click on the Geography tab.

      • Don’t check the box next to “Canada” as it will include nationwide data for the entire country.

      • Expand the list by clicking the “+” symbol next to “Canada.” and select the desired provinces: British Columbia (BC).


    • Step2.2 Choose the Variables:

      • Navigate to the Estimates tab.

      • From the list of available variables, select Marketed production and Cultivated area, total


    • Step2.3 Choose the Commodity tab:

      • Please check the box next to the type of fruit you want.

      • Important: To view all types of fruits, ensure you enter a sufficiently large number in the “Show XXX Members” tab; otherwise, some fruit types may be hidden.

    • Step2.4 Click on the Farm area, production, value tab:

      • click the according unit matches the Estimates we selected.
    • Step2.5 Click on the Reference period tab:

      • Select the years, ranging from 1910 to 2024.
    • Step2.6 Customize the Layout:

    • Adjust the layout settings as preferred.

    • Note that these settings will only affect the display of the dataset on the website and will not influence the structure of the data when it is downloaded.


  • Step 3: Download the dataset

    • First Click on the “Download options” Button, then from the list of available download formats, choose the “Download selected data (for database loading).”

4.3 R code for data reading & wrangling

  • The data is read within ../climate_extreme_RA/R_agricultural/read_data.R
  • The final data is called data_statcan_fruit

# File paths for Canola, Barley, and Fruit data
file_paths <- c("../data/agri/Fruit_prod.csv",
                "../data/agri/Fruit_area.csv")

# Define the columns needed
#unit is ton
needed_columns <- c("REF_DATE", "VALUE", "Commodity","Estimates","UOM")

# Function to read and select necessary columns, and add crop type
read_and_select_fru <- function(file_path) {
  read_csv(file_path) %>%
    select(all_of(needed_columns))%>% 
    rename(Crop_Type = Commodity)
}

# Read and combine all datasets
data_statcan_fruit <- map_dfr(file_paths, read_and_select_fru)



# Convert the data types
data_statcan_fruit$REF_DATE <- as.integer(data_statcan_fruit$REF_DATE)
data_statcan_fruit$VALUE <- as.numeric(data_statcan_fruit$VALUE)

#check num of missing
sum(is.na(data_statcan_fruit))
  • Within the ../climate_extreme_RA/report_agricultural/FAOSTAT.Rmd

    In Part 4: New Crop Data, Section 4.2: Fruits Data, we present:

    1. A line plot illustrating all fruit yield data from 2002 to 2024.

    2. A line plot for marketed production (in tons) and total cultivated area (in hectares).

    3. A line plot showing the total cultivated area (in hectares), with the overall mean value applied for years prior to 2002.

4.4 Conclusion

  • Using the mean for past years’ area data (before the red dash line) is not ideal, as demonstrated by the yield data for apples and grapes shown below.

  • We could analyze the yield data from the past 20 years in relation to EHF to identify any potential patterns. This approach may provide insights into the correlation between extreme heat events and yield fluctuations.

  • However, for a more comprehensive analysis, additional area data is necessary. A

5 STATCanada Vegetable data

5.1 Description:

  • Dataset ID: “Table: 32-10-0365-01 (formerly CANSIM 001-00013)”

  • Year Range: 1940-2024

  • Frequency: Annual

  • Variables:

    • Average yield per hectare (kilograms)

    • Average yield per acre (pounds)

      Comment: yield data is from 1940-2017


    • Area planted (acres)

    • Area planted (hectares)

    • Area harvested (hectares)

    • Area harvested (acres)


    • Marketed production (tons)

    • Marketed production (metric tonnes)

    • Total production (tons)

    • Total production (metric tonnes)

  • Geography: Canada, Province-wide: British Columbia (BC); Other provinces

To investigate the potential correlation between temperature extremes, including heat waves, and long-term yield patterns in British Columbia, a comprehensive yield dataset is required.

Therefore, it is necessary to estimate the yield data after 2017 to 2024.


5.2 Data Access

Link to data source: StatisticsCanada

If the link does not work, the data can be accessed through the official website of the Statistics Canada by following the navigation path outlined below:

Home > Data (in the search bar, search for “Area, production and farm gate value of marketed vegetables” or Dataset ID: 32-10-0365-01) > Area, production and farm value of marketed vegetables


5.3 Select and download data in the web

  • Step1: click the Add/Remove button


  • Step2: the pic below shows the layout of column filter option

    • Step2.1 Select the Geography:

      • Click on the Geography tab.

      • Don’t check the box next to “Canada” as it will include nationwide data for the entire country.

      • Expand the list by clicking the “+” symbol next to “Canada.” and select the desired provinces: British Columbia (BC).


    • Step2.2 Choose the Variables:

      • Navigate to the Estimates tab.

      • From the list of available variables, select all except Farm gate value (dollars)


    • Step2.3 Choose the Commodity tab:

      • Please check the box next to the type of fruit you want.

      • Important: To view all types of fruits, ensure you enter a sufficiently large number in the “Show XXX Members” tab; otherwise, some fruit types may be hidden.

    • Step2.4 Click on the Reference period tab:

      • Select the years, ranging from 1910 to 2024.
    • Step2.5 Customize the Layout:

    • Adjust the layout settings as preferred.

    • Note that these settings will only affect the display of the dataset on the website and will not influence the structure of the data when it is downloaded.


  • Step 3: Download the dataset

    • First Click on the “Download options” Button, then from the list of available download formats, choose the “Download selected data (for database loading).”

5.4 R code for data reading & wrangling

  • The data is read within ../climate_extreme_RA/R_agricultural/read_data.R
  • The final data is called data_veg_1940
# Read the CSV file
file_paths <- c("../data/agri/Vegetable_Data.csv")
# Define the columns needed
needed_columns <- c("REF_DATE", "Estimates","VALUE","Commodity","UOM")
read_and_select <- function(file_path) {
  read.csv(file_path) %>%
    select(all_of(needed_columns))%>% 
    rename(Crop_Type = Commodity)
}
# Read and combine all datasets
data_veg <- map_dfr(file_paths, read_and_select)

#na
sum(is.na(data_veg$VALUE))
# Impute missing VALUEs with the mean of neighboring VALUEs
data_veg <- data_veg %>%
  group_by(Crop_Type,Estimates) %>%
  mutate(VALUE = na.approx(VALUE, rule = 2))
sum(is.na(data_veg$VALUE))

# Clean the Crop_Type column by removing "Fresh" and the numbers
data_veg$Crop_Type <- gsub("Fresh ", "", data_veg$Crop_Type)
data_veg$Crop_Type <- gsub("\\s*\\[.*\\]", "", data_veg$Crop_Type)

#NUM of crop type
length(unique(data_veg$Crop_Type))
#filter some veg out, keep only long term data
data_veg_1940 <- data_veg %>% 
  filter(Crop_Type != "broccoli" & Crop_Type != "Brussels sprouts" & Crop_Type != "eggplants (except Chinese eggplants)" & 
           Crop_Type != "French shallots and green onions" &  Crop_Type != "garlic" &  
           Crop_Type != "Other fresh fine herbs" &  Crop_Type != "Other fresh melons" &  
           Crop_Type != "Other fresh vegetables" &  Crop_Type != "parsley" &  
           Crop_Type != "peppers" &  Crop_Type != "pumpkins" &  Crop_Type != "radishes" &  
           Crop_Type != "squash and zucchini" &  Crop_Type != "sweet potatoes" &  
           Crop_Type != "Total fresh vegetables" &  Crop_Type != "watermelons" &  
           Crop_Type != "rhubarb"  &
           Crop_Type != "leeks" &  Crop_Type != "tomatoes" &  Crop_Type != "cucumbers and fresh gherkins (all varieties)")
length(unique(data_veg_1940$Crop_Type))


veg_year_2<- data_veg_1940 %>%
  group_by(Crop_Type,Estimates) %>%
  summarize(Start_Year = min(REF_DATE), End_Year = max(REF_DATE))
  • In the script located at ../climate_extreme_RA/R_agricultural/explore_qualified_crop.R, we are tasked with calculating and populating the average yield values for the years 2017 to 2024 (since it is missing from the original yield data) through the calculate_yields().

    • This involves handling two types of production data and two types of area data, each in different units.

    • We aim to calculate the yields for all combinations and subsequently compare these yields with the original yield data from 1940 to 2017.

    • Based on this comparison, we determine that:Marketed production (tons transfer to kg) and Area harvested (hectares) should be utilized.

  • Upon selecting the appropriate columns and calculating the yields for 2017 to 2024, the next step involves analysis the results within ../climate_extreme_RA/report_agricultural/agri_lm_model.Rmd.

    • In Part 1: Vegetation Data, a comparative analysis of the calculated yields versus the original historical yields is presented.

    • Crops will be classified into five groups based on the quality of the simulated yield.

      • Good quality will be defined as a perfect match between the calculated yield and the historical yield

      • While poor quality will be indicated by the presence of outliers or significant discrepancies between the two sets of yield data.

5.5 Conclusion

  • Research is required to determine which types of fruits are highly influenced by heat waves or cold waves. Additionally, data must be adjusted to reflect the full agricultural cycle year for each crop.

  • For instance, if the 2021 yield data for Crop A corresponds to planting in March 2021 and harvesting in October 2021, the climate data for the year 2021 will be used, as the entire crop cycle falls within the same calendar year.

  • However, if the 2021 yield data pertains to crops planted in November or December 2020 and harvested in May 2021, in this case, the “year” for climate data should reflect the period from the planting season in late 2020 through the harvesting season in 2021.

6 STATCanada Field Crop data

6.1 Description:

  • Dataset ID: “Table: 32-10-0002-01 (formerly CANSIM 001-0071)”

  • Year Range: 1990-2024

  • Frequency: Annual

  • Variables:

    • Average yield per hectare (kilograms)

    • Average yield per acre (pounds)

  • Geography: Canada, Province-wide: British Columbia (BC); Other provinces


6.2 Data Access

Link to data source: StatisticsCanada

If the link does not work, the data can be accessed through the official website of the Statistics Canada by following the navigation path outlined below:

Home > Data (in the search bar, search for “Estimated areas, yield and production of principal field crops by Small Area Data Regions, in metric and imperial units” or Dataset ID: 32-10-0002-01) > Estimated areas, yield and production of principal field crops by Small Area Data Regions, in metric and imperial units


6.3 Select and download data in the web

  • Step1: click the Add/Remove button


  • Step2: the pic below shows the layout of column filter option

    • Step2.1 Select the Geography:

      • Click on the Geography tab.

      • Check the box next to “British Columbi”

      • Expand the list by clicking the “+” symbol next to “Canada.” and select the rest region.

      • we picked the station FortStJohn: 1910-2024 from the Peach River region for analysis.


    • Step2.2 Choose the Variables:

      • Navigate to the Harvest disposition tab.

      • From the list of available variables, select Averave yield.


    • Step2.3 Choose the Type of crop tab:

      • Please check the box next to Oat, Barley, Peas,dry, Rye, fall remaining, Wheat, spring and Canola as other crop is missing for BC.

      • Important: To view all types of fruits, ensure you enter a sufficiently large number in the “Show XXX Members” tab; otherwise, some fruit types may be hidden.

    • Step2.4 Click on the Reference period tab:

      • Select the years, ranging from 1990 to 2024.
    • Step2.5 Customize the Layout:

    • Adjust the layout settings as preferred.

    • Note that these settings will only affect the display of the dataset on the website and will not influence the structure of the data when it is downloaded.


  • Step 3: Download the dataset

    • First Click on the “Download options” Button, then from the list of available download formats, choose the “Download selected data (for database loading).”

6.4 R code for data reading & wrangling

  • The data is read within ../climate_extreme_RA/R_agricultural/read_data.R
  • The final data is called data_statcan_crop
# Read the CSV file
file_paths <- c("../data/agri/FieldCropsEstimates.csv")

# Define the columns needed
needed_columns <- c("REF_DATE", "Harvest.disposition","VALUE",'GEO',"Type.of.crop","UOM")

# Function to read and select necessary columns, and add crop type
read_and_select <- function(file_path) {
  read.csv(file_path) %>%
    select(all_of(needed_columns)) %>%
    #rename(Production = VALUE) %>%
    rename(Crop_Type = Type.of.crop)
}

# Read and combine all datasets
data_statcan_crop <- map_dfr(file_paths, read_and_select)


#filter Harvest.disposition to be Average yield (kilograms per hectare)
crop_yield <- data_statcan_crop %>% filter(Harvest.disposition == "Average yield (kilograms per hectare)")
total_produ <- data_statcan_crop %>% filter(Harvest.disposition == "Production (metric tonnes)")

#rename
crop_yield <- crop_yield %>% rename(yield = VALUE)
total_produ <- total_produ %>% rename(Production = VALUE)
# Impute missing values with the mean of neighboring values
crop_yield <- crop_yield %>%
  group_by(Crop_Type,GEO) %>%
  mutate(yield = na.approx(yield, rule = 2))

total_produ <- total_produ %>%
  group_by(Crop_Type,GEO) %>%
  mutate(Production = na.approx(Production, rule = 2))

6.5 Conclusion

  • The dataset only contains 30 years of data. However, there is specific regional data available, and further analysis can be conducted using temperature data from stations such as Fort St. John in the Peace River region.

Reference

Anderson, Sam, and Valentina Radić. 2023. “Modeling the Streamflow Response to Heatwaves Across Glacierized Basins in Southwestern Canada.” Water Resources Research 59 (12): e2023WR035428. https://doi.org/https://doi.org/10.1029/2023WR035428.
Malinina, Elizaveta, and Nathan P. Gillett. 2024. “The 2021 Heatwave Was Less Rare in Western Canada Than Previously Thought.” Weather and Climate Extremes 43: 100642. https://doi.org/https://doi.org/10.1016/j.wace.2024.100642.
Sonnewald, Sophia, Jessica van Harsselaar, Kirsten Ott, Julia Lorenz, and Uwe Sonnewald. 2015. “How Potato Plants Take the Heat?” Procedia Environmental Sciences 29: 97. https://doi.org/https://doi.org/10.1016/j.proenv.2015.07.178.